PM2.5 refers to airborne particles with diameters less than 2.5 micrometers. These particles are small enough to be able to penetrate into human circulation systems and therefore could result in even more serious impacts to the human health compared to larger particles such as PM10. Many epidemiological studies have revealed that elevated PM2.5 concentrations were associated to a number of adverse health effects such as cardiovascular and respiratory diseases as well as lung cancer. Currently, PM2.5 is one of the major air pollutants in many countries/regions. It has been a particularly serious problem in urban areas. In addition, PM2.5 also lead to a series of environmental effects including visibility impairment, environmental damage and material damage.
PM2.5 is in fact a very complex mixture of a number of chemical species, water droplets and even microbes attached to them. The compositions (i.e. the proportion of these components) of PM2.5 is closely related to its physical-chemical properties and toxicity. Therefore, it has been one of the major focuses of many researchers and scientists for better understanding their effects on human health. In the meantime, the composition is also the fingerprint of pollution sources and scientists have been using them to trace and quantify these major emission sources.
The United States Environmental Protection Agency (EPA) has established multiple nationwide monitoring networks[1] to measure PM2.5 concentration as well as its components to support air quality management, air quality research and studies on public health in the United States. However, due to the consideration of cost and limitation of analytical techniques, EPA only routinely measures a subset of the major components which are organic carbon (OC), elemental carbon (EC), inorganic ions (sulfate, nitrate, and ammonium), and major elements (mineral elements, salt, and others). Since oxygen (O) and hydrogen (H) are not directly measured in these networks, researchers have been exploring various equations to account for their presence, thereby approximating gravimetric mass of PM2.5. These equations predominantly took the form of linear equations with the components as independent variables and PM2.5 concentration as dependent variables[2]. In air quality research, researchers usually adopt an approach that infer linear coefficients based on both the possible chemical forms of these components (e.g. iron could be present in the nature as Fe3O4) and the measured data (e.g. regression).
The objective of this project is to utilize the measurements (both PM2.5 and its components) collected from the US nationwide Chemical Speciation Network (CSN) and explore potential regression models to predict PM2.5 concentration based on the composition of PM2.5. This project also aims to evaluate the possibility of utilizing purely mathematical approaches to reconstruct/predict PM2.5 concentrations using all or a subset of the component concentration.
The raw data files were retrieved from US EPA air quality monitoring data repository[3]. The 24-hour aggregated PM2.5 and component concentration were collected as part of the US Chemical Speciation Network (CSN). The CSN network is comprised of over 100 monitoring stations across the US. Each air monitoring station measures water-soluble ions, nitrate, sulfate, organic carbon (OC), elemental carbon (EC), elements in the PM2.5 and meteorological conditions (temperature, wind direction, wind speed, relative humidity, etc.). Sampling occurs every 1 in 3 or every 1 in 6 days. For this project, we selected data collected from a monitoring station located in the New York City during 2011 and 2014.
Data collection and cleaning procedure includes:
. retrieve the raw data files for PM2.5 and its components for year 2011 to 2014
. identify the station and sampler codes to extract the data for New York City from these raw data files
. keep only a subset of the variables that are relevant to this analysis
. examine the missing values and keep only days that have data for all the variables
. export the formatted and cleaned dataset to a new set of csv files. Data for 2011 and 2012 (trainingset.csv) were used to train the model and data for 2013 and 2014 (testingset.csv) were used to test the regression models.
The cleaning and formatting process dramatically reduced the size of the testing and training datasets. For example, the original raw datasets are as large as over 700MB and contain as many as over 2 million rows. The compiled and cleaned datasets for analysis are typically around 30 KB and contain only about 160 rows.
Before we evaluate the multiple regression models, the exploratory analysis was performed for the training dataset to understand the characteristics of these variables.
## 'data.frame': 166 obs. of 38 variables:
## $ Antimony. : num 0 0 0.011 0.033 0.009 0.036 0 0.005 0.016 0.021 ...
## $ Arsenic. : num 0.002 0 0 0 0.0026 0 0.001 0 0.002 0 ...
## $ Aluminum. : num 0.031 0.028 0.001 0.026 0.032 0.031 0.013 0.026 0.007 0.043 ...
## $ Barium. : num 0 0 0.009 0 0 0 0 0.009 0 0 ...
## $ Bromine. : num 0.0025 0.0032 0.0041 0.0029 0.0035 0.0036 0.0043 0.0075 0.0026 0.0061 ...
## $ Cadmium. : num 0 0 0.011 0 0 0 0 0.019 0.012 0.002 ...
## $ Calcium. : num 0.0587 0.0426 0.0437 0.0621 0.123 0.0424 0.0631 0.111 0.0571 0.135 ...
## $ Chromium. : num 0.001 0.001 0.002 0.002 0.002 0 0.001 0 0 0 ...
## $ Cobalt. : num 0.001 0.0023 0.001 0.0016 0.001 0 0.001 0 0.001 0.001 ...
## $ Copper. : num 0.002 0.003 0.007 0.0032 0.0395 0.001 0.008 0.009 0 0.0071 ...
## $ Chlorine. : num 0.0497 0.03 0.006 0.0301 0.109 0.008 0.063 0.044 0.106 0.16 ...
## $ Cerium. : num 0 0 0 0 0 0 0 0 0.002 0 ...
## $ Cesium. : num 0 0 0.002 0.004 0 0 0 0.001 0 0 ...
## $ Iron. : num 0.105 0.0719 0.0713 0.214 0.136 0.0869 0.115 0.282 0.0641 0.2 ...
## $ Lead. : num 0.002 0.004 0.005 0.0063 0.0102 0.001 0.0077 0.003 0 0.0028 ...
## $ Indium. : num 0.013 0.007 0.012 0.011 0 0 0 0.015 0.004 0 ...
## $ Manganese. : num 0.001 0.0028 0.0034 0.0027 0.0043 0.001 0.0049 0.0065 0 0.005 ...
## $ Nickel. : num 0.0082 0.0045 0.0063 0.008 0.014 0.004 0.0051 0.0062 0.0043 0.0157 ...
## $ Magnesium. : num 0 0 0.024 0 0 0 0 0 0 0 ...
## $ Phosphorus.: num 0 0 0.003 0 0.009 0 0 0 0.005 0.016 ...
## $ Selenium. : num 0 0 0.001 0 0.002 0 0.001 0.001 0.001 0 ...
## $ Tin. : num 0 0.03 0 0 0 0 0 0.011 0.013 0 ...
## $ Titanium. : num 0 0.003 0.003 0.002 0 0 0.001 0.0088 0 0.004 ...
## $ Vanadium. : num 0.002 0 0.0064 0.001 0.0132 0.001 0.006 0.0116 0.001 0.0061 ...
## $ Silicon. : num 0.029 0.02 0.0355 0.032 0.044 0.023 0.086 0.0943 0.0378 0.0632 ...
## $ Silver. : num 0 0 0.016 0 0 0 0 0.011 0.011 0.008 ...
## $ Zinc. : num 0.033 0.0287 0.031 0.0416 0.0771 0.0175 0.0246 0.032 0.011 0.0661 ...
## $ Strontium. : num 0.001 0.001 0.001 0 0.001 0 0 0 0.001 0 ...
## $ Rubidium. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Zirconium. : num 0 0.0064 0 0 0 0 0 0.004 0.003 0 ...
## $ Ammonium : num 2.06 1.32 2.92 1.68 1.74 1.32 3.16 2.88 0.295 1.3 ...
## $ Sodium : num 0.05 0.07 0.07 0.09 0.08 0.04 0.07 0.18 0.224 0.196 ...
## $ Potassium : num 0.074 0.079 0.083 0.058 0.074 0 0.083 0.134 0.034 0.064 ...
## $ Nitrate : num 4.34 2.35 5.24 2.9 2.43 1.38 5.99 4.82 0.554 2.68 ...
## $ OC : num 2.94 2.77 3.35 2.5 3.47 2.05 3.48 6.49 2.44 3.47 ...
## $ EC : num 1.4 1.16 1.2 1.23 2.08 1.07 1.17 3.34 0.525 1.55 ...
## $ Sulfate. : num 2.58 2.33 3.42 2.65 3.35 2.77 2.79 3.01 0.834 1.39 ...
## $ PM2.5 : num 16.6 12.7 18.9 13.7 16.7 9.7 19.7 27 6.2 13.9 ...
## [1] "Summary Statistics of the training dataset"
## Antimony. Arsenic. Aluminum.
## Min. :0.000000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.00300
## Median :0.000000 Median :0.0000000 Median :0.01500
## Mean :0.007765 Mean :0.0004341 Mean :0.02096
## 3rd Qu.:0.011750 3rd Qu.:0.0010000 3rd Qu.:0.03075
## Max. :0.072000 Max. :0.0040000 Max. :0.12500
## Barium. Bromine. Cadmium. Calcium.
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.002000 1st Qu.:0.00000 1st Qu.:0.03335
## Median :0.000000 Median :0.002800 Median :0.00000 Median :0.04780
## Mean :0.001121 Mean :0.003106 Mean :0.00194 Mean :0.06319
## 3rd Qu.:0.000000 3rd Qu.:0.003975 3rd Qu.:0.00075 3rd Qu.:0.06298
## Max. :0.018000 Max. :0.018300 Max. :0.02200 Max. :1.50000
## Chromium. Cobalt. Copper.
## Min. :0.000000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.002600
## Median :0.001000 Median :0.0010000 Median :0.004200
## Mean :0.002037 Mean :0.0007235 Mean :0.004973
## 3rd Qu.:0.002000 3rd Qu.:0.0010000 3rd Qu.:0.006675
## Max. :0.042700 Max. :0.0029000 Max. :0.039500
## Chlorine. Cerium. Cesium. Iron.
## Min. :0.00000 Min. :0.0000000 Min. :0.00000 Min. :0.00210
## 1st Qu.:0.00400 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.08712
## Median :0.00985 Median :0.0000000 Median :0.00000 Median :0.11600
## Mean :0.03372 Mean :0.0001084 Mean :0.00112 Mean :0.13101
## 3rd Qu.:0.02210 3rd Qu.:0.0000000 3rd Qu.:0.00100 3rd Qu.:0.16475
## Max. :1.56000 Max. :0.0030000 Max. :0.01000 Max. :0.32100
## Lead. Indium. Manganese.
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.001000
## Median :0.002000 Median :0.000000 Median :0.002100
## Mean :0.002072 Mean :0.003187 Mean :0.002319
## 3rd Qu.:0.003000 3rd Qu.:0.001750 3rd Qu.:0.003300
## Max. :0.012400 Max. :0.041000 Max. :0.008000
## Nickel. Magnesium. Phosphorus.
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.002225 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.003250 Median :0.000000 Median :0.000000
## Mean :0.004195 Mean :0.006387 Mean :0.001157
## 3rd Qu.:0.005000 3rd Qu.:0.008000 3rd Qu.:0.000000
## Max. :0.015700 Max. :0.122000 Max. :0.072000
## Selenium. Tin. Titanium.
## Min. :0.0000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.0000000 Median :0.000000 Median :0.002000
## Mean :0.0004181 Mean :0.002904 Mean :0.002229
## 3rd Qu.:0.0010000 3rd Qu.:0.000000 3rd Qu.:0.003000
## Max. :0.0032000 Max. :0.046000 Max. :0.009700
## Vanadium. Silicon. Silver. Zinc.
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.001000 1st Qu.:0.03000 1st Qu.:0.000000 1st Qu.:0.01100
## Median :0.003000 Median :0.04610 Median :0.000000 Median :0.01760
## Mean :0.004282 Mean :0.06207 Mean :0.001619 Mean :0.02242
## 3rd Qu.:0.006100 3rd Qu.:0.06928 3rd Qu.:0.000000 3rd Qu.:0.02675
## Max. :0.025300 Max. :1.36000 Max. :0.018000 Max. :0.13000
## Strontium. Rubidium. Zirconium.
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
## Median :0.0000000 Median :0.0000000 Median :0.0000000
## Mean :0.0006488 Mean :0.0001723 Mean :0.0007681
## 3rd Qu.:0.0010000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
## Max. :0.0110000 Max. :0.0020000 Max. :0.0093000
## Ammonium Sodium Potassium Nitrate
## Min. :0.0000 Min. :0.01630 Min. :0.00000 Min. :0.0952
## 1st Qu.:0.4482 1st Qu.:0.04000 1st Qu.:0.00915 1st Qu.:0.6110
## Median :0.7895 Median :0.07000 Median :0.02700 Median :1.1800
## Mean :1.0200 Mean :0.09725 Mean :0.03744 Mean :1.6760
## 3rd Qu.:1.3200 3rd Qu.:0.11300 3rd Qu.:0.05058 3rd Qu.:2.0075
## Max. :3.9400 Max. :1.31000 Max. :0.30400 Max. :8.1200
## OC EC Sulfate. PM2.5
## Min. :0.029 Min. :0.0000 Min. :0.058 Min. : 2.900
## 1st Qu.:2.042 1st Qu.:0.7615 1st Qu.:1.250 1st Qu.: 7.525
## Median :2.535 Median :1.0700 Median :1.820 Median :10.350
## Mean :2.820 Mean :1.2022 Mean :2.198 Mean :11.777
## 3rd Qu.:3.365 3rd Qu.:1.5100 3rd Qu.:2.725 3rd Qu.:14.575
## Max. :6.490 Max. :3.6600 Max. :9.600 Max. :29.500
## [1] "Summary of standard deviations of each variable"
## Antimony. Arsenic. Aluminum. Barium. Bromine. Cadmium.
## 1 0.01351449 0.0008385659 0.02242708 0.002729698 0.002107611 0.00437167
## Calcium. Chromium. Cobalt. Copper. Chlorine. Cerium.
## 1 0.1182131 0.003887062 0.0007453223 0.004054483 0.1329741 0.0004267438
## Cesium. Iron. Lead. Indium. Manganese. Nickel.
## 1 0.002201364 0.06011016 0.002371596 0.006902442 0.001750726 0.003122156
## Magnesium. Phosphorus. Selenium. Tin. Titanium. Vanadium.
## 1 0.01368735 0.006294744 0.0006866951 0.007416794 0.002081085 0.004761376
## Silicon. Silver. Zinc. Strontium. Rubidium. Zirconium.
## 1 0.1079582 0.00388983 0.01751811 0.001357482 0.0004165699 0.001863884
## Ammonium Sodium Potassium Nitrate OC EC Sulfate.
## 1 0.8129256 0.1230461 0.04183482 1.531039 1.076872 0.6116432 1.383489
## PM2.5
## 1 5.736485
We would also like to visualize the relative contribution of each component to the total PM2.5 concentration by the following bar chart. There are 38 components measured and reported by the EPA in this dataset. From the bar chart, we see that organic carbon, sulfate, nitrate, elemental carbon and ammonium account for about 75% of total PM2.5. Other species only account for about one quarter with many elements contributing to negligible amount of PM2.5 mass.
The time series of PM2.5 and its selected key components show seasonal variations. For example, PM2.5 concentration has two peaks in summer and winter months. Sulfate tends to be more abundant in the air in the summer whereas nitrate concentration primarily peaks during cold seasons.
## Using Date as id variables
Further investigation of the distribution of these variables in the training dataset suggest that many elements are present in trace level and frequently reported as zero because the concentrations are below detection limit. The concentration of Other major components as well as PM2.5 appear to be following lognormal-like distribution.
## Using Date as id variables
## Saving 7 x 5 in image
Using subsetting, we would like to find out the best-subset. Forward, backward and hybrid methods were explored in this analysis.
For example, the following graphs visualize the results from the forward subsetting method.
## [1] "Forward subsetting"
Backward and hybrid method generate similar outputs and were therefore not shown in this report. The following table compares the number of variables in the best-fit model by method and criteria metrics. General consistency among the methods were observed and the major disagreement on the number of variables in the best-fit model is on the criteria metrics. For example, using BIC as the criterion, the best-fit mode reduced the total number of predictors to 8 or 9.
## [1] "Comparison of the number of variables in the best-fit model"
| method | Cp | BIC | adjr2 |
|---|---|---|---|
| forward | 16 | 9 | 18 |
| backward | 15 | 8 | 17 |
| hybrid | 15 | 8 | 17 |
Due to the complexity of evaluating every possible “best-fit” model suggested by our analysis. We chose the hybrid model with 8 variables as an example and further evaluated its performance. This model suggests that aluminum, calcium, vanadium, zirconium, nitrate, OC, EC and sulfate as the primary predictors[4].
We first built a simple linear model using the lm function.
##
## Call:
## lm(formula = PM2.5 ~ ., data = training.clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5467 -0.9236 -0.0770 0.7302 6.9903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05432 0.72262 -0.075 0.940201
## Antimony. 0.10231 13.18986 0.008 0.993823
## Arsenic. 79.09069 223.09004 0.355 0.723530
## Aluminum. -18.70193 10.31103 -1.814 0.072054 .
## Barium. 52.10580 76.94355 0.677 0.499504
## Bromine. -36.94899 107.40413 -0.344 0.731397
## Cadmium. 21.36963 48.87916 0.437 0.662708
## Calcium. 6.75570 8.83898 0.764 0.446091
## Chromium. -54.22773 55.28679 -0.981 0.328520
## Cobalt. -452.94674 258.82612 -1.750 0.082514 .
## Copper. 24.94005 59.57917 0.419 0.676207
## Chlorine. 2.88164 2.13684 1.349 0.179863
## Cerium. -297.94003 398.75607 -0.747 0.456329
## Cesium. 109.38041 77.05210 1.420 0.158165
## Iron. -6.07475 4.64218 -1.309 0.193016
## Lead. -43.39213 85.14568 -0.510 0.611194
## Indium. -12.94100 25.28986 -0.512 0.609738
## Manganese. 266.42961 132.33635 2.013 0.046183 *
## Nickel. -55.03376 114.02669 -0.483 0.630176
## Magnesium. -9.71286 19.10295 -0.508 0.612014
## Phosphorus. -3.06297 45.09807 -0.068 0.945957
## Selenium. -0.41866 271.13067 -0.002 0.998770
## Tin. -36.47231 22.65694 -1.610 0.109913
## Titanium. -80.04393 92.65108 -0.864 0.389243
## Vanadium. -117.99091 52.34268 -2.254 0.025883 *
## Silicon. 1.02088 9.62033 0.106 0.915656
## Silver. -52.72089 50.85873 -1.037 0.301870
## Zinc. 13.72810 24.55020 0.559 0.577013
## Strontium. 38.36922 126.47089 0.303 0.762090
## Rubidium. 180.64266 445.45261 0.406 0.685768
## Zirconium. 474.40053 90.49812 5.242 6.36e-07 ***
## Ammonium 0.39389 1.23842 0.318 0.750961
## Sodium 0.06818 2.64351 0.026 0.979465
## Potassium -0.36520 4.49301 -0.081 0.935345
## Nitrate 0.80154 0.41880 1.914 0.057864 .
## OC 1.76803 0.33151 5.333 4.23e-07 ***
## EC 1.82927 0.63687 2.872 0.004772 **
## Sulfate. 1.66945 0.45590 3.662 0.000365 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.897 on 128 degrees of freedom
## Multiple R-squared: 0.9152, Adjusted R-squared: 0.8907
## F-statistic: 37.34 on 37 and 128 DF, p-value: < 2.2e-16
A first glimpse at the results indicate that the results of the linear regression is far from ideal.
1) All the coefficients have to be positive in order to make sense practically because the independent variables are components of PM2.5 and represent concentrations in the air.
2) The large standard errors associated with the estimates of coefficients are present.
3) A big portion of the p-values are fairly large, suggesting that the corresponding coefficients are NOT statistically distinguishable from 0.
The variance inflation factor (VIF) is equal to 11.792711, which is much greater than 2 (widely accepted level for lm regression), suggesting strong impact of the multicollinearity in this dataset (recall that no multicollinearity is one of the assumptions for the linear model).
It is not surprising that many components are inter-correlated. For example, silicon and calcium has a correlation of 0.9726204. The practical explanation is that both silicon and calcium are “crustal elements”, meaning that they both predominantly come into PM2.5 from soil/dust that is resuspended by natural and human activities. These correlations are very indicative in the environmental science research but could dramatically harm the linear regression model we are investigating here.
With cross-validation, principal component regression analysis identified that 5 variables were able to explain over 80% of the variations of PM2.5. Increasing the number of variables only marginally increase the R2 and reduce the mean squared prediction error (MSEP). Therefore, with the purpose of dimension reduction, regression model with 5 variable was selected for the further evaluation in this project.
## Antimony. Arsenic. Aluminum. Barium. Bromine. Cadmium.
## -0.02954041 0.15975805 0.50783592 0.08421147 0.52201557 -0.13195551
## Calcium. Chromium. Cobalt. Copper. Chlorine. Cerium.
## -0.01166957 -0.05574703 -0.02294926 0.37928667 0.03053707 -0.31483617
## Cesium. Iron. Lead. Indium. Manganese. Nickel.
## -0.24889491 0.29354575 0.17326462 0.04186939 0.32290633 0.19987074
## Magnesium. Phosphorus. Selenium. Tin. Titanium. Vanadium.
## -0.05167313 -0.32315997 0.20791993 0.01516300 0.21623966 0.60070544
## Silicon. Silver. Zinc. Strontium. Rubidium. Zirconium.
## 0.09264434 -0.12441157 -0.07383630 0.08561231 0.21078614 0.16132446
## Ammonium Sodium Potassium Nitrate OC EC
## 0.72988855 0.03843268 0.45430156 0.39156287 0.83763790 0.75335539
## Sulfate.
## 0.84164923
## [1] 1.5905
## 38 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.2327931
## Antimony. -0.3693021
## Arsenic. 146.8439285
## Aluminum. -1.9840391
## Barium. 72.6500479
## Bromine. 106.0231000
## Cadmium. -6.2525330
## Calcium. 2.2455402
## Chromium. -48.6316988
## Cobalt. -291.6304553
## Copper. 37.8671311
## Chlorine. 2.3190229
## Cerium. -392.6371290
## Cesium. 50.8906724
## Iron. -1.2172543
## Lead. -19.9090654
## Indium. -12.7458894
## Manganese. 215.6197469
## Nickel. -10.6526738
## Magnesium. -4.6040578
## Phosphorus. -6.2305396
## Selenium. 130.8275639
## Tin. -17.1925657
## Titanium. -20.8652995
## Vanadium. -26.7817509
## Silicon. 2.4149033
## Silver. -49.3633590
## Zinc. 8.6812635
## Strontium. 56.8969152
## Rubidium. 286.5269589
## Zirconium. 374.1075373
## Ammonium 1.3369798
## Sodium -0.5227367
## Potassium 3.4606923
## Nitrate 0.3988884
## OC 1.1647456
## EC 1.5981218
## Sulfate. 0.9270153
By examining the coefficients of the ridge regression model, it appears that many variables have negative and/or significantly large coefficients. As discussed in the previous section, in the practical research, it is not preferable to have negative values. In addition, large coefficients indicate the model adds considerable amount of weight to the corresponding variables, which may be alarming in some cases (especially for trace elements).
As we have probably seen that many components are inter-correlated, to some extent at least. This may prevent the effective modeling using multivariate regression. Lasso regression is able to aggressively reduce dimensionality by pushing many of the coefficients to zero which may be helpful in the presence of variables that are frequently zero or close to zero.
## [1] 0.3195685
## 38 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.07442779
## Antimony. .
## Arsenic. .
## Aluminum. .
## Barium. .
## Bromine. .
## Cadmium. .
## Calcium. 3.04636341
## Chromium. .
## Cobalt. .
## Copper. .
## Chlorine. 0.56430824
## Cerium. .
## Cesium. .
## Iron. .
## Lead. .
## Indium. .
## Manganese. 121.55395982
## Nickel. .
## Magnesium. .
## Phosphorus. .
## Selenium. .
## Tin. .
## Titanium. .
## Vanadium. .
## Silicon. .
## Silver. .
## Zinc. .
## Strontium. .
## Rubidium. .
## Zirconium. 202.06559077
## Ammonium 2.41793184
## Sodium .
## Potassium .
## Nitrate 0.02448812
## OC 1.51867150
## EC 1.55921202
## Sulfate. 0.63219105
The results from the Lasso regression suggest that only 9 (out of 37) variables (components) end up having non-zero coefficients.
| model | RMSE | R^2 |
|---|---|---|
| lasso | 1.74 | 0.8753 |
| ridge | 1.979 | 0.8321 |
| subsetting | 2.122 | 0.8239 |
| lm | 2.308 | 0.792 |
| pcr | 2.391 | 0.7506 |
## 38 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.07442779
## Antimony. .
## Arsenic. .
## Aluminum. .
## Barium. .
## Bromine. .
## Cadmium. .
## Calcium. 3.04636341
## Chromium. .
## Cobalt. .
## Copper. .
## Chlorine. 0.56430824
## Cerium. .
## Cesium. .
## Iron. .
## Lead. .
## Indium. .
## Manganese. 121.55395982
## Nickel. .
## Magnesium. .
## Phosphorus. .
## Selenium. .
## Tin. .
## Titanium. .
## Vanadium. .
## Silicon. .
## Silver. .
## Zinc. .
## Strontium. .
## Rubidium. .
## Zirconium. 202.06559077
## Ammonium 2.41793184
## Sodium .
## Potassium .
## Nitrate 0.02448812
## OC 1.51867150
## EC 1.55921202
## Sulfate. 0.63219105
Based on the results, Lasso regression yielded the best fit and prediction. Prediction using principal component regression model exhibits the largest RMSE. In general, all five evaluated models showed decent to good capability of making predictions using the testing dataset. A close took at the coefficients generated by the Lasso regression lead to the finding that components that account for large portion of the PM2.5 mass as well as the ones commonly correlated with other components were retained. For example, sulfate, nitrate, OC, EC, ammonium are the species accounting about three quarters of the PM2.5 mass. Chlorine is correlated to with several sea salt elements (sodium, magnesium etc.) and calcium is correlated to mineral species such as silicon, iron etc. To some extent, we can even identify some major sources of PM2.5 from this shorter list of variables.
Considering the additional benefit of dimension reduction by Lasso regression (9 variables in the final model), it is still the optimal choice. By examining the coefficients of Lasso regression, we see that manganese and zirconium have large coefficients, this might be due to their concentration that are close to zero which causes larger uncertainties.